Whenever an engineer publishes their work, the manuscript itself should be accompanied with a dataset and the software used to produce the results presented in the manuscript. This workflow is still a challenge for many researchers in engineering sciences for two reasons. The first reason is that especially in bespoke experiments (TA Alex), researchers produce large amount of data that is difficult to make available in public or institutional repositories. The second reason is that the notion of open science and transparent publishing is a relative novelty in engineering sciences. Below, two engineering science researchers report about their experience in using research data management practices in the process of publishing their work and formulate how-tos as well as lessons learned.
Measures:
Manuscripts in engineering sciences are often like a wool yarn ball: they, indeed, rely on data and code, but these are “wrapped up into a ball” and not transparent to the reader or the reviewer. By unwrapping the wool yarn ball and making the necessary datasets and software findable and accessible, the transparency of the manuscript can be ensured.
In engineering sciences and mainly in use-cases related to the NFDI4ing archetype ALEX, the following entities are important:
Ideally, all of these entities are to be made findable (by using PIDs - persisent identifiers such as DOI) and accessible (by storing them with open access).
Below, we will describe how we attempted to follow this concept when writing a publication.
This use case is based on the publication
Logan, K.T.; Leštáková, M.; Thiessen, N.; Engels, J.I.; Pelz, P.F. Water Distribution in a Socio-Technical System: Resilience Assessment for Critical Events Causing Demand Relocation. Water 2021, 13, 2062. https://doi.org/10.3390/w13152062
Here is its shortened abstract:
This study presents an exploratory, historically-informed approach to assessing resilience for critical events that cause demand relocation within a water distribution system (WDS). Considering WDS as an interdependent socio-technical system, demand relocation is regarded as a critical factor that can affect resilience similarly to the more commonly analyzed component failures such as pipe leaks and pump failures. Critical events are modeled as events during which consumer nodes are evacuated within a perimeter varying in size according to a typical length scale in the studied network. The required demand drops to zero in the evacuated area, and the equivalent demand is relocated according to three sheltering schemes. Results are presented for analyzing the effect of the size of the evacuated area, the feasibility of sheltering schemes, vulnerability of particular parts of the city as well as the suitability of network nodes to accommodate relocated demand using a suitable resilience metric. The results provided by this metric are compared with those drawn from common graph-based metrics. The conclusions are critically discussed under the consideration of historical knowledge to serve as a basis for future research to refine resilience assessment of socio-technical systems.
The study was performed as a virtual experiment using self-written code in the programming language Python and using the Python package WNTR for hydraulic simulation. The following steps were performed:
Below, the approach is described in more detail.
According to the HDFgroup, the hierarchical data format (HDF5) can be used “to manage, process, and store your heterogeneous data” and “is built for fast I/O processing and storage.” It allows to store your data along with metadata in an easy-to-understand hierarchical structure.
For python users, there are two libraries that provide interface to HDF5 format:
For viewing the files, the HDFgroup developed the tool HDFView.
Both of these packages provide useful tools to store datasets to the HDF5 file and read them back out, as well as to equip the datasets with metadata.
This was our main motivation to reasonably group many measurement results together and ending up with huge HDF5 files (~400GB). Each simulation result was equipped with metadata that documented the setup parameters that were used to create it. After initial problems trying to figure out which of the Python libraries is better for what purposes (they are both advantageous in slightly different cases and generally cannot be used simultaneously), we managed to find a good way of storing the data.
In the next step (the analysis), our approach was to query the HDF5 files with raw results in order to be able to perform statistical analysis on them. This proved to be incredibly time consuming, given the size of the HDF5 files and the lack of good querying tools in both of the python packages. The packages are able to query efficiently after group and dataset names but unfortunately not after metadata.
In retrospect, it would have been more reasonable to either store the measurement results into the HDF5 files in smaller batches, and/or create a multiindex pandas dataframe where the multiindex would correspond to the metadata and where the path to the hdf5 group/dataset would be stored. Pandas has efficient tools to query through the multiindex DF. Unfortunately for us, this realization came too late and we had to stick with the slow querying using h5py.visititems.
Collaboration over Gitlab and following the GitHub workflow was of great help for us. We developed the code collaboratively and prevented any conflicts or bugs that could have resulted had we not used it. Moreover, it was very helpful when we ran the analysis on other machines, as it allowed us to clone the code quickly and without problems.
Our original intention was to publish the data with our institutional repository, TUdatalib. The size of the files proved to be a problem, as they only allow to upload files that are multiple orders of magnitude smaller. Hence, we started looking for other repositories, ideally ones that offer open access. A good database for looking for data repositories that also allows searching for those with open data uploads is re3data.
Most of data repositories have restrictions when it comes to uploading, for example they only offer institutional access, require registration, or limit file size. Zenodo, for example, limits the file size to 50GB, although it also offers exeptions upon request.
The code is stored as a GitLab project, which is not ideal as it is not equipped with a PID. In near future, we would like to archive it along with the data either in TUdatalib or another repository.
Ideally, the manuscript, the code and the data should be cross-linked to assure findability and transparency. We contacted our journal (MDPI) whether it would be possible to retroactively add the links to the code and data to the published manuscript and they declined our request. This could be something to consider in the future publications and when choosing a suitable journal.
work in progress
This work has been funded by the LOEWE initiative (Hesse, Germany) within the emergenCITY center.